Factorized Topic Models 



Cheng Zhang Carl Henrik Ek Hedvig Kjellstom 

CVAP/CAS, KTH CVAP/CAS, KTH CVAP/CAS, KTH 

CO ' Stockholm, Sweden Stockholm, Sweden Stockholm, Sweden 

chengz@kth.se chek@kth.se hedvig@kth.se 

o 

Abstract 

In this paper we present a modification to a latent topic model, which makes the 
model exploit supervision to produce a factorized representation of the observed 
data. The structured parameterization separately encodes variance that is shared 
between classes from variance that is private to each class by the introduction of a 
r K ' ■ new prior over the topic space. The approach allows for a more efficient inference 

and provides an intuitive interpretation of the data in terms of an informative signal 
together with structured noise. The factorized representation is shown to enhance 
inference performance for image, text, and video classification. 
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1/*} ' 1 Introduction 

> " 

Representing data in terms of latent variables is an important tool in many applications. A generative 
latent variable model provides a parameterization that encodes the variations in the observed data, 
t^J- \ relating them to an underlying representation, e.g., a set of classes, using some kind of mapping. It 

is important to note that any modeling task is inherently ill-conditioned as there exists an infinite 
number of combinations of mappings and parameterizations that could have generated the data. 
To that end, we choose different models, based on different assumptions and preferences that will 
£T) ■ induce different representations, motivated by how well they fit the data and for what purpose we 

wish to use the representation. 

Inference in generative models meets difficulties if the variations in the observed data are not rep- 
resentative of the variations in the underlying state to be inferred. As an example (see Figure |7), 
consider a visual animal classifier, trained with, e.g., SIFT [20] features extracted from training im- 
^ \ ages of horses, cows and cats with a variation of fur texture. The task is now to classify an image of 

a spotted horse. Based on the features, which will mostly pick up the fur texture, the classifier will 
be unsure of the class, since there are spotted horses, cows and cats in the training data. The core of 
the problem is that fur texture is a weak cue to animal class given this data: Horses, cows and cats 
can all be red, spotted, brown, black and grey. Shape is on the other hand a strong cue to distinguish 
between these classes. However, the visual features will mostly capture texture information - the 
shape information (signal) is "hidden" among the significantly richer texture information (structured 
noise) making up the dominant part of the variation in the data. 

In this paper we address this issue by explicitly factorizing the data into a structured noise part, 
whose variations are shared between all classes, and a signal part, whose variations are characteristic 
of a certain class. For our purposes, it is very useful to think about data as composed of topics. 
Probabilistic topic models ||23l Qj] HI 13 model a data example as a collection of words (in the case 
of images, visual words), each sampled from a latent distribution of topics. The topics can be thought 
of as different aspects of the data - a topic model trained with the data in our animal example above 
might model one topic for shape and another for fur texture, and a certain data instance is modeled 
as a combination of a certain shape and a certain texture. 

Our approach is to encourage the topics to assume either a very high correlation or a very low 
correlation with class. The class can then be inferred using only the class-specific topics, while the 
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shared topics are used to explain away the aspects of the data that are not interesting to this particular 
inference problem. We present a variant of a Latent Dirichlet Allocation (LDA) 1 2 ] model which is 
able to model the signal and structured noise separately from the data. This new model is trained 
using a factorizing prior, which partitions the topic space into a private signal part and a shared noise 
part. The model is described in Section [3] 

Experiments in Section [4] show that the proposed model outperforms both the standard LDA and 
a supervised variant, SLDA [3|, on classification of images, text, and video. Furthermore, the 
explicit noise model increases the sparsity of the topic representation. This is encouraging for two 
reasons: firstly, it indicates that the factorized LDA model is a better model of class compared to the 
unrestricted LDA; enabling better performance on any inference or data synthesis task. Secondly, 
it enables a more economical data representation in terms of storage and computation; crucial for 
applications with very large data sets. The factorization method can be applied to other topic models 
as well, and the sparse factorized topic representation is beneficial not only for classification, as 
shown here, but also for synthesis [5 1, ambiguity modeling G), and domain transfer ETl . 

2 Related Work 

In this section we will create a context for the model that we are about to propose by relating it 
to factorized latent variable models in general and topic models in specific. Providing a complete 
review of either is beyond the scope of this paper, why here we will focus on only the most relevant 
subset of work needed to motivate the model. 

The motivation for learning a latent variable model is to exploit the structure of the new representa- 
tion to perform tasks such as synthesis or prediction of novel data, or to ease an association task such 
as classification. For continuous observations, several classic algorithms such as Principal Compo- 
nent Analysis (PCA) and Canonical Correlation Analysis (CCA) can be interpreted as latent variable 
models [Q] EH 021127]]. Another modeling scenario is when observations are provided in the form of 
collections of discrete entities. An example is text data where a document consists of a collection of 
words. One approach to encode such data is using a latent representation that groups words in terms 
of topics. Several approaches for automatically learning topics from data have been suggested in the 
literature. A first proposal of a generative topic model was Probabilistic Latent Semantic Indexing 
(pLSI) |11|. The model represents each document as a mixture of topics. The next important devel- 
opment in terms of a Bayesian version of pLSI by adding a prior to the mixture weights. This was 
done by the adaptation of a Dirichlet layer and referred to as Latent Dirichlet Allocation (LDA) flU. 

Central to the work presented in this paper is a specific latent structure simultaneously proposed 
by several authors J7] [13] [14] QjO. Given multiple observation modalities of a single underlying 
state, the purpose of these models is to learn a representation that separately encodes the modality- 
independent variance from the modality-dependent. The latent representation is factorized such 
that the modality-independent and modality-dependent are encoded in separate subspaces |6|. This 
factorization has an intuitive interpretation in that the private space encodes variations that exists in 
only one modality and does therefore encode variations representing the ambiguities between the 
modalities Q. 

In this paper we will exploit a similar type of factorization within a topic model, but instead of ex- 
ploiting correlations between observation modalities, we employ a single observation modality and 
a class label associated with each observation. In specific, our approach will encourage a factoriza- 
tion relating to class, such that the topics will be split into those encoding within-class variations 
from those that encode between-class variations. Such a factorization becomes interesting for in- 
ferring the class label from unseen data; the class-shared topics can be considered as representing 
"structured noise" while only the private class topics contain the relevant for class inference. 

However, it is not easy to directly transfer the above factorization, formulated between modalities 
and described for continuous data, to topic models, which are inherently discrete. Results have 
been presented |[T2l [25] [28) for the case of two conditionally independent observation modalities, 
addressing the image and text cross-modal multimedia retrieval problem with topic representation. 
In fl~2ll a model that can be seen as a Markov random field of LDA topic models is presented. 
The topic distribution of each topic model affect the underlying topic spaces of other topic models, 
connected to that model through the Markov random field. Further, in [1251 CCA is applied to the 
topic space of the text data, which in turn has been learned from LDA and the image feature space. 
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(a) LDA with class label QD (b) Our factorized LDA 

Figure 1: Graphic representation of LDA structures. The notation in (b) is adopted from Jia et al. |[T2l . 



LDA and CCA are used as two separate steps. Differently, [28] instead use a Hierarchical Dirichlet 
Process (HDP) based method which has a complexity selection property. It takes the topics that only 
describe variance in only one modality as the private space, which explains away the information that 
cannot be matched between different modalities. This is an extension of |22] to multi-modalities, 
hence it can not be generalized to other topic models, such as LDA or pLSI and it can not be used 
to model the private and shared information with only one modality. 



Differently from [12 28 25], which need to model the shared topics and private topics in the joint 
topic space across different observation modalities, our factorization takes place over one modality 
across different classes, where the structured noise is modeled in the class-shared topics and the 
signal is modeled in class-private topics. Furthermore, and importantly, our approach is flexible 
and can be easily transferred to any type of topic model. Our choice of LDA stems from the fact 
that it has previously been successfully applied for a large range of data and has desirable sparsity 
properties that makes for an efficient model. 

Topic models, and the LDA model in specific, are motivated by the benefit of representations that 
are sparse in terms of the distribution of topics for each document. In addition to this, the model 
we are about to present aims to encourage a specific structure of the topic themselves. This notion 
is not new and have been proposed by several other authors. In [9] the topics are represented as 
combinations of a small number of latent components as such leading to a more compact model. In 
|29l the each topic is constrained by the words in the vocabulary. However, none of these models 
aim to learn a topic structure that is related to class. 



3 Model 

As described in the introduction, we add factorization to a model that describes variations of data 
in terms of a set of latent topics. We seek a structured representation that encodes topics containing 
within-class, or class private, variations separately from those containing variations that are shared 
between the classes. We apply our factorization framework to an adaptation of LDA, which in- 
corporates additional class information to recover such a factorized latent space. In this section, 
the traditional LDA model El is first revisited, followed by the description of our factorized topic 
model. 



3.1 LDA Revisited 

Formally a document w consist of a collection of words w = [wi, . . . , wn] from a vocabulary 
indexed by {1, . . . , V}. Within a topic model each document of TV words is described as a mixture 
of K topics such that each word is associated with a specific topic: z = [z±, . . . , zn], where z n G 
{1, . . . , K}. The mixture is defined as 

N K 

p(w|z,/3) = Y[ ^2p(Wn\ZmP k ) (!) 
n=l k=l 

where (3 k is the distribution over the vocabulary for topic k. The novelty, and the reason for the 
success, of the LDA model is how the topics z and the topic vocabulary (3 are constructed within the 
framework. The underpinning intuition is that the topics should present a compact representation 
with K <C N, and that the structure of the topics should be sparse such to achieve a robust and 
interpretable model. Assuming the topics z to be governed by a multinomial distribution, z ~ 
Multi(z\0), sparsity can be achieved by choosing the parameters as governed by a Dirichlet 
distribution, ~ Dir(0\a). By the same motivation a Dirichlet prior is placed over the topic- 
vocabulary distribution (3 ~ Dir(/3\ir). As the Dirichlet is conjugate to the multinomial distribution, 
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the marginal likelihood can be reached analytically by combining the likelihood with the prior and 
performing the integration, 

N 

^2p(z n \0)p(w n \z n , /3 Z J 



p(w|a,7r) = / p(0\a)lf[ 

J \n=l 



p(f3\7r) dO (2) 



from which the parameters of the model can be learned. 

One way of incorporating class information within the LDA framework was suggested in [ 8 ] where 
the use of a class dependent topic distribution was proposed. This was implemented by using the 
class variable c as a "switch"; p(0|a,c) = Ylj=i Dir(0\aj)° cj where Sij is the Kronecker delta 
function. Using this model the class can be inferred for a new document w* through a maximum 
likelihood procedure c = arg max c p(w* |a, 7r, c) 0. 

In this paper we take inspiration from the work presented in JS). However, we choose to incorporate 
the class information in a slightly different manner. In specific, we use a factorizing prior over the 
topic distribution, which firstly encourages sparsity, and secondly introduces a preference for a class 
conditioned structure, such that separate topics encode within-class variations and between-class 
variations in the data. Thus, the model we will propose have a stronger class dependency compared 
to . We will now proceed to describe and motivate the relevance of this class dependency. 

3.2 Factorized Topic Model 

As motivated in Section [TJ our idea is to separate the topic space into two parts, where the class- 
private part explains the class-dependent information (signal) and the shared part explains the class- 
independent information (structured noise). To achieve this we introduce an additional prior p(6) to 
the model presented in (8). This will encourage a factorized structure such that the K topics can be 
"softly" split into K p class-private topics and K s shared topics where K p + K s = K. The advantage 
of such a structured topic space is that it will be more compact than a regular model; all aspects of the 
data that correlate with class will be pushed into the class-private part of the topic space. Since the 
other, class-shared, part of the topic space will then only contain noise, the class of a new document 
w* will in effect be inferred using only the class-private part. Further, in our model, we will use 
the same sparsity prior a over the topics for all classes. This removes the additional flexibility of 
allowing a different topic sparsity for each class - which can be relevant in certain special cases — 
but the gain is a more robust model with fewer free parameters , requiring less training data. 

In the following, let class be the topic distributions of all classes, obtained by marginalizing over 
class. Its rows are defined as 0^ lass ex ^™=i Qm&c m c-> where c m indicates the class label of the 
m th document and 5^ the Kronecker delta function. Figure [14] gives an illustration of how class is 
formed from 6. Examples of class distributions can be seen in Figures [3] |U|5] and [6] 

Intuitively, the private topics would concentrate to a certain class in class ? while the shared top- 
ics would be more spread among all classes (more uniformly distributed over a column in class ). 
Information entropy, widely used in different fields [19] EH, provides a good measurement of this 
property. In this case, we employ an entropy-like measure H(k) over class for each topic k: 

^ C ^class ^class 

H{k) = g y£W s 1os( 3f^r } (3) 

where 6f l ss is the element in row c and column k of class . H(k) e [0, 1], if all the probability in 
the topic k is concentrated to one class, 1 if all classes are equally probable to contain the topic k. 

To split the topics into a private and shared part, we wish the prior p(0\k) to encourage topics k to 
either have a low H(k) (be very class-specific) or high H(k) (be very class-unspecific). Hence, we 
introduce a function (Figure 0a)) as: 

A(k) = H(k) 2 - H(k) + 1 . (4) 

The prior is defined as: 

P (0) a n A(k) . (5) 



K 



k=l 
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This prior thus treats each column of C ass independently, as illustrated in Figure [14] The rela- 
tion between this prior, the Dirichlet prior p(0\a, c) and other priors in the literature, e.g., [flZl . is 
discussed in Appendix B. 

With the additional prior (Figure Q2b)), the generative model becomes: 



p(w|a,7r,c)= ( p{0\a,c)p{0)(\[ 

J \n=l 



^p(z n \0)p(w n \z n ,f3) 



p(/3\ir) dO . 



(6) 



Learning. We use Gibbs sampling for learning the parameters of the model, more specifically, 
collapsed Gibbs sampling fTOl in the same manner as [12"]. The factorizing prior presents itself 
in the learning as an additional factor in the objective function over z, compared to the original 
LDA model. The learning procedure is further detailed in Appendix B. It should be noted that 
the factorizing prior in Equation [5] is independent of the type of learning procedure - the model in 
Equation [6] can also be trained using, e.g., a variational method. 

When training the model, the topics are initialized randomly, which means that they all have a H 
close to 1. During Gibbs sampling, it would be very unlikely to find a topic with low H, given 
the bimodality of A in Equation Q. To address this problem, we introduce an "auto-annealing" 
procedure, where A is replaced with a dynamic cooling function starting off by encouraging low 
H only, and gradually encouraging high H more and more, as the average H decreases (i.e., when 
some topics have found a class-specific state). Hence, A is changed to a dynamic function 

A(k) = H{k) 2 - 2HH{k) + 1 (7) 

where the average H, H = ££=1 H(k)/K, is used as an annealing parameter in the function. 
Figure [S^b) shows the dynamic A function for three different values of H. 

As with other annealing procedures, the "auto-annealing" procedure means that the factorizing prior 
p{0) changes in each step of the iterative learning procedure. In a normal annealing procedure, this 
change would be actuated by changing the annealing parameter. Here, H can be thought of as an 
autonomous annealing parameter since it converges automatically to a value reflecting the fraction 
of the class-dependent versus class-independent variation in the data. For example, the text data set 
(Figure[4]) has a lower H than the natural scene dataset (Figure [5]). 

Segmenting the topic space. When the model have been trained we can evaluate the structure of 
the learned topic space by computing H{k) for each topic k. We consider topics with low H as 
class-dependent while topics with high H are considered as independent. As such the topic space 
can be "softly" segmented and interpreted in a class conditioned manner. As an example, the words 
building up the shared topics can be considered as stop words. In text processing, there is usually 
some standard stop words list, which can be used to pre-process the text. However, these stop words 
are predefined, for example, "the", "at" etc. However, they sometimes also provide class-relevant 
information, for example, some topics are more location dependent or have more nouns. On the other 
hand, there are words, like "learning", "performance" etc, which do not carry much information in, 
say, a machine learning conference corpus. In our model, we automatically learn the real stop words 
for the given domain. Furthermore, while it is easy to predefine the stop words in text data, this 
problem becomes much more challenging in computer vision applications. The "stop-visual- words" 
are ill-defined and much less intuitive to find, why an algorithm which automatically learns them, 
such as the one we propose, is very beneficial. We would like to emphasize again that there is still 
only one topic space; no hard splitting or removal of topics is done, neither for learning, nor for 
inference. 



4 Experiments 

The proposed model is evaluated on four different classification tasks, and compared to two baselines 
consisting of a regular LDA model with class label [8 1, and a model with stronger class-supervision 
in the topic learning, SLDA 0. 

4.1 Object Classification 

We first demonstrate how the factorization works using a toy dataset. The dataset, shown in Fig- 
ure [2] is constructed to have a very high degree of structured noise. There are four object classes: 
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Figure 2: All the instances in the toy object dataset. 
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Figure 3: Toy object dataset. (a) Regular LDA topic distribution marginalized over class # class ? topics sorted 
in ascending order of class-specificity, (b) SLDA topic distribution marginalized over class Q class j topics sorted 
in ascending order of class-specificity, (c) Factorized LDA topic distribution marginalized over class class , 
topics sorted in ascending order of class-specificity, red line indicating partition between P and s . 



bulb, car, duck, and mug. All 8 instances of a certain class have the same shape and image loca- 
tion. However, there is a very high intra-class variability in foreground and background texture. 
Furthermore, all four classes contain the exact same foreground-background texture combinations. 
Thus, the texture (which will dominate the variation among features from any visual extractor) can 
be regarded as structured noise, while the true signal relates to shape. The properties of this dataset 
can also be found to some extent in natural images: most realistic image and object classes display 
large intra-class appearance variation, and different classes share appearance aspects. Furthermore, 
the backgrounds in natural scenes are often complex and varying, introducing even more variation 
among training data for a class. 

SIFT features on two different scales are densely extracted from all images, and a 64- word vocabu- 
lary is learned in which all SIFT features are represented. Thus, each image is represented by a bag 
of visual words in this vocabulary. 

The experiment is performed in a hold-one-out manner, where each image in turn is classified using 
a model trained on the other 3 1 images. In the following, we will by "regular LDA" mean the regular 
LDA with upstream supervision presented in [8 1, but trained using Gibbs sampling in the same way 
as our model, with the same value of a for all documents. With "SLDA", we mean the more strongly 
supervised LDA variant with downstream supervision presented in [3 ], implemented by Blei et al. 

Our proposed factorized LDA, as well as regular LDA and SLDA, are trained with 15 topics, a = 
0.1 and 7r = 0.2. The classification performance for each class is found by averaging over the 
performances for the 8 images of that class. 

It should be noted that the test image always will have a texture that is different from the training 
images of that class. However, the same texture can be found in other classes. A classifier that 
tries to explain all variation in the data in terms of class variation will therefore have difficulties 
in modeling this data set; a regular LDA or SLDA model trained with this data will be forced to 
represent texture as well as shape in the same topics, since the Dirichlet prior will promote topic 
sparsity. Thus, very few topics will purely represent one class, as shown in Figures [3(a)| and [3(b)! 

However, our model, which explicitly factorizes the topics into those private to a certain class and 
those shared between all classes, will allow the relevant shape variation to be represented separately 
from the texture variation, which will just confuse the classification in this case. Figure [3(c)] shows 
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Figure 4: Reuters 21578 R8 dataset. (a) Regular LDA topic distribution marginalized over class class , topics 
sorted in ascending order of class-specificity, (b) SLDA topic distribution marginalized over class class ? topics 
sorted in ascending order of class-specificity, (c) Factorized LDA topic distribution marginalized over class 
class , topics sorted in ascending order of class-specificity, red line indicating partition between 6 P and s . 
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Figure 5: Natural scene dataset. (a) Regular LDA topic distribution marginalized over class class ? topics 
sorted in ascending order of class-specificity, (b) SLDA topic distribution marginalized over class class ? topics 
sorted in ascending order of class-specificity, (c) Factorized LDA topic distribution marginalized over class 
class , topics sorted in ascending order of class-specificity, red line indicating partition between 6 P and s . 



the factorized topic distribution; it is clear that the topics in P are private to a certain class, while the 
noise topics in 6 s are shared equally over all classes; all the structured noise has thus been pushed 
into 6 s . Thus, even though the full topic space is used for classi ficati on, it is effectively only based 
on 6 P , while the shared topics 6 s (right of the red line in Figure [3(c)) are effectively disregarded in 
the classification since they appear with equal probability in all classes. Figure [9] shows examples 
of visual words in the topics of our factorized LDA. 

As expected, the explicit noise model greatly improves classification on this dataset (Figures [I0(a)[ 
|10(c)| ), |10(b)| ): the factorized LDA reaches 81.25%, while a regular LDA reaches a classification 
rate of 34.38%, only slightly above chance, and SLDA who is forced by the stronger supervision to 
represent all variation (where texture is dominating) in terms of class achieves a result of 0% since 
the texture of the test image is not present in the training data of the same class. 

4.2 Text Classification 

We now evaluate the proposed model in a realistic text classification scenario. We use the stan- 
dard R8 training and testing set from the Reuters 21578 dataset l26lh which contains 5485 training 
documents and 2189 testing documents. The all-terms version of the data is used since we want to 
illustrate how our model deals with noise. 

The regular LDA, SLDA and factorized LDA models are trained with 20 t opics , and para meter 
settings a = 0.5 and tt = 0.1. The topic distributions are shown in Figures [4(a)] |4(b)| and |4(c)[ 
The factorized class-private topic distribution P (left of the red line in Figure |4(c)| ) is noticeably 
cleaner than the regular distribution class (Figure |4(a)| ). In the factorized LDA, only 6 P contributes 
to the classification, while the shared topics 6 s (right of the red line in Figure |4(c)| » are effectively 
disregarded since they appear with equal probability in all classes. The topics of the SLDA model 
are sparser (Figure |4(b)) , but all topics are forced to be class-specific by the stronger supervision. 
Figure [TT] shows examples of words in the topics of our factorized LDA. 

There is a significant classification improvement using the factorized topic space, from 74.63% with 
regular LDA ( Figure [12(a)) and 63.75% with SLDA (Figure [12(b)]) to 83.91% with factorized LDA 
(Figure [T2(c)| ). It should be noted that SLDA has better performance on the classes "acq" and "earn", 
which have large amounts of data, and worse performance on "grain" and "ship", which have small 
amounts of data. The probable reason is that SLDA requires a larger amount data than regular LDA, 
since it has more parameters to learn. 

4.3 Scene Classification 

We also evaluate the proposed model on a challenging natural scene dataset used in [8]. There are 
four classes: forest, mountain, open country and coast, with 100 training images and 50 test images 



7 




(a) class with regular LDA 



(b) class with SLDA 



(c) class with factorized LDA 



Figure 6: Action dataset. (a) Regular LDA topic distribution marginalized over class class , topics sorted in 
ascending order of class-specificity, (b) SLDA topic distribution marginalized over class class ? topics sorted 
in ascending order of class-specificity, (c) Factorized LDA topic distribution marginalized over class class , 
topics sorted in ascending order of class-specificity, red line indicating partition between P and 6 s . 



per class. From each image, SIFT features on two different scales are densely extracted, and labeled 
according to a 192- word vocabulary learned from the features, as in [8). 

The regular LDA, SLDA, and factorized LDA models are trained with 20 topics, and parameter 
settings a = 0.5 and tt — . 1 .Figures |5 (a)[ [5(b)! and |5(c)| show the respective topic distributions; 
notably, the class-specific topic space P effectively used for classification in our factorized LDA 
only contains 8 topics, while 12 topics (6 s ) are devoted to modeling structured noise. Thus, the 
factorized representation is notably sparser than a regular LDA representation, which gives the op- 
portunity to save both storage space and computation time during classification - an important factor 
to take into account for large datasets. 

In addition to rendering a notably sparser data representation, the factorized LDA reaches a 
marginally higher performance rate than with a regular LDA and SLDA: 84.50% for our model 
compared to 80.50% for the regular LDA and 84.00% for SLDA. All performances are slightly 
better than the original implementation of the regular LDA [8 ], which reaches 76.0%. 



4.4 Action Classification 

We proceed to evaluate the methods on a dataset with more variation independent of class. The 
dataset consists of three actions from the KTH Action dataset [15]: boxing, handclapping and hand- 
waving. There are 100 short video sequences of each action, which show 25 different people per- 
forming the action, recorded in four shooting conditions (zooming and panning of camera, different 
background ). The shooting condition has large influence on the motion in the video, as each zoom 
or panning motion adds global motion to the video and backgrounds contribute to the motion fea- 
tures as well. However, the variation in shooting condition is not at all correlated with action class 
in the dataset. Just as in the toy experiment above (but now in a more realistic setting), a large pro- 
portion of the data variation is thus independent of the action class. Due to the low signal-to-noise 
ratio, a topic model without factorization will have difficulties capturing the aspects of data relevant 
for discriminating activity class. 

The experiment was performed by separating out from the training data all the 25 images of an action 
filmed with a certain shooting condition. The topic models were then trained with all other data, 
and evaluated with the 25 removed images. Hence, the certain combination of action and camera 
condition in the test data was not present in the training data. This was done for all actions in turn, 
and the result was averaged over actions. 

STIP features [1131 were extracted from all sequences and clustered into a vocabulary of 128 spatio- 
temporal words. This representation was used to train the regular LDA, SLDA and factorized LDA 
models with 10 topics, a = 0.1 and tt = 0.1. 

Figure [6] shows the topic distributions corresponding to these three models. We can see that Fac- 
torized LDA is able to model the class-dependent information (left of the red line) and the class- 
independent information ( right of the red line), which makes it be able to archive better performance 
in noisy data. For the regular LDA, although the topics are not shared, however, it models all the 
information and assigned that to different classes with new topics which made the topics themselves 
became noisy. So does SLDA which models the "noise" as the useful topics. 

Factorized LDA gives an accuracy of 65.22%, which is far better than both regular LDA, 38%, and 
SLDA, 51.33%. This confirms that the findings of the toy experiment above applies to realistic 
settings as well. Confusion matrices are shown in Figures p~3 (a)[| 1 3 (b)l and [13(c)] respectively. 
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5 Conclusions 



We present a factorized latent topic model, which explicitly represents aspects of the data which 
are not correlated with model state. Specifically, we train an LDA class model with an additional 
factorizing prior, which encourages topics to either be very class-specific or evenly shared among 
classes. The topic space 6 is thus partitioned into one part 6 P whose topics are private to certain 
classes, and another part s with topics shared between classes. Only P contributes effectively to 
classification. 

Experiments show the factorized LDA model to give consistently better classification performance 
and sparser topic representations than both a regular LDA model [8] and SLDA [3]. Sparse repre- 
sentations are advantageous for large datasets since they save storage space and computation time 
during classification. 

Future work includes investigating the effect of this factorization prior on other topic models, such 
as HDP, and to integrate the prior into models with multiple data views, such as in lfT2ll28l[30l . 

References 

[1] F. R. Bach and M. I. Jordan. A probabilistic interpretation of canonical correlation analysis. 
Technical report, 2005. 

[2] D. M. Blei. Probabilistic topic models. Communications of the ACM, 55(4):77-84, 2012. 

[3] D. M. Blei and J. D. McAuliffe. Supervised topic models, arxiv: 1003.0783, 2010. 

[4] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation. Journal of Machine 
Learning Research, 3:993-1022, 2003. 

[5] A. Damianou, C. H. Ek, M. Titsias, and N. D. Lawrence. A bayesian hierarchical model for 
learning natural scene categories. In ICML, 2012. 

[6] C. H. Ek. Shared Gaussian Process Latent Variable Models, PhD Thesis, 2009. 

[7] C. H. Ek, J. Rihan, P. H. S. Torr, G. Rogez, and N. D. Lawrence. Ambiguity modeling in latent 
spaces. Machine Learning for Multimodal Interaction, 2008. 

[8] L. Fei-Fei and P. Perona. A bayesian hierarchical model for learning natural scene categories. 
In CVPR, 2005. 

[9] M. R. Gormley, M. Dredze, B. Van Durme, and J. Eisner. Shared components topic models. 
InNAACL, 2012. 

[10] G. Heinrich. Parameter estimation for text analysis. Technical report, 2005. 
[11] T. Hofmann. Probabilistic latent semantic analysis. In UAI, 1999. 

[12] Y. Jia, M. Salzmann, and T. Darrell. Learning cross-modality similarity for multinomial data. 
In ICCV, 2011. 

[13] A. Klami and S. Kaski. Generative models that discover dependencies between data sets. In 
IEEE Workshop on Machine Learning for Signal Processing, 2006. 

[14] A. Klami and S. Kaski. Probabilistic approach to detecting dependencies between data sets. 
Neurocomputing, 72(l-3):39-46, 2008. 

[15] I.Laptev. On space-time interest points. IJCV, 64(2/3): 107-123, 2005. 

[16] N.D. Lawrence. Probabilistic non-linear principal component analysis with Gaussian process 
latent variable models. Journal of Machine Learning Research, 6:1783-1816, 2005. 

[17] G. Leen and C. Fyfe. A Gaussian process latent variable model formulation of canonical 
correlation analysis. In European Symposium on Artificial Neural Networks, 2006. 

[18] G. Leen and C. Fyfe. Learning shared and separate features from two related data sets using 
GPLVM's. In Learning from Multiple Sources Workshop, Neural Information Processing, 
2008. 

[19] B. Leibe, A. Leonardis, and B. Schiele. Robust object detection with interleaved categorization 
and segmentation. IJCV, 77(l):259-289, 2008. 



9 



[20] D. G. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 60(2):91-110, 
2004. 

[21] R. Navaratnam, A. W. Fitzgibbon, and R. Cipolla. The joint manifold model for semi- 
supervised multi- valued regression. In ICCV, 2007. 

[22] J. Paisley, C. Wang, and D. Blei. The discrete infinite logistic normal distribution. arXiv 
preprint arXiv: 1103.4789, 201 1. 

[23] C. H. Papadimitriou, P. Raghavan, and H. Tamaki. Latent semantic indexing: a probabilistic 
analysis. Journal of Computer and System Sciences, (61):217-235, 2000. 

[24] J. P.W. Pluim, J.B. A. Maintz, and M. A. Viergever. Interpolation artefacts in mutual 
information-based image registration. CVIU, 77(2): 1077-3 142, 2000. 

[25] N. Rasiwasia, J. Pereira, E. Coviello, G. Doyle, G. Lanckriet, R. Levy, and N. Vasconcelos. A 
new approach to cross-modal multimedia retrieval. In MMM, 2010. 

[26] Reuters-21578. http://csmining.org/index.php/r52-and-r8-of-reuters-21578.html. 

[27] M. E. Tipping and C. M. Bishop. Probabilistic principal component analysis. Journal of the 
Royal Statistical Society: Series B (Statistical Methodology), 61(3):61 1-622, 1999. 

[28] S. Virtanen, Y. Jia, A. Klami, and T. Darrell. Factorized multi-modal topic model. In UAI, 
2012. 

[29] C. Wang and D. M. Blei. Decoupling sparsity and smoothness in the discrete hierarchical 
dirichlet process. In NIPS, 2009. 

[30] C. Wang, D. M. Blei, and L. Fei-Fei. Simultaneous image classification and annotation. In 
CVPR, 2009. 

A Additional Figures 

Figures 7-11 would potentially be included in a journal version of this paper, and are appended for 
clarity to the manuscript submitted for review to ICLR. All four figures are referred to from the text. 



Horse, cow or cat? 




K"f ■■■■ 



Topic distribution of cats 



Figure 7: Example illustrating our approach (see the Introduction). The topics are factorized into those cor- 
related with class (in this example, green, shape features) and those not correlated with class (in this example, 
red, texture features). Without this factorization, the texture information, which is more prominent among the 
features extracted from the image, would obscure the shape information, which carries the actual information 
about class. 
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(a) Static function (b) Dynamic function 

Figure 8: Partitioning function A. (a) The static function A, Equation $4$, which encourages topics with either 
high or low class-specificity, (b) The dynamic energy optimization function A, Equation (0, for three different 
values of H. 




(a) Topic 2 (b) Topic 3 (c) Topic 4 (d) Topic 15 (e) Topic 14 (f) Topic 13 

Figure 9: Topics learned from toy object dataset. (a) Topic specific to the bulb class. Topic 1 and 2 are very 
similar since they represent the same class, see Figure 3(a). (b) Topic specific to the car class, (c) Topic specific 
to the duck class, (d-f) shared/noise topics. The visual word images are generated by averaging the 10NN of 
each word cluster center. We see a certain trend in class-specific topics (a-c) towards large/global structures 
(e.g., one edge across the window) while there is a trend in the class-shared topics (d-f) towards small/local 
structures (such as many stripes or dots). This indicates that class-specific topics represent shape to a higher 
degree, and that class-shared topics represent texture to a higher degree. 



0.13 

0.13 I 




(a) Regular (b) SLDA (c) Factorized 

Figure 10: Toy object dataset. (a) Confusion matrix for regular LDA: 34.38%. (b) Confusion matrix for 
SLDA: 0%. (c) Confusion matrix for factorized LDA: 81.25%. 
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Topic 3 


the of to and a said in on for at reuter s port were with spokesman shipping was strike union 


(a) Top 20 words from private (earn, crude, ship) topics 


Topic 20 


in billion the pet from to year of mln and a s marks last said were was group january february 


Topic 19 


the to a in of said that and s he be is would for on not as was but have 


Topic 18 


national co pacific western security insurance first financial southern american texas united 
reuter power gas city service santa California federal 


(b) Top 20 words from shared/noise topics 



Figure 11: Topics learned from Reuters 21578 R8 dataset. (a) Topics specific to the earn, crude, ship class, 
(b) shared/noise topics. The most class-specific topics contain words that clearly relate to that specific class, 
whereas the shared topics contain general words that can be expected in texts of all classes. Interestingly, topic 
19 contains words that are considered as stop words while topics 18 and 20 contain more unusu al wo rds. This 
correlates with the probabilities of documents having the topics (i.e., the column sums in Figure [4(c)] ). 
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(c) Factorized 

Figure 12: Reuters 21578 R8 dataset. (a) Confusion matrix for regular LDA: 74.63%. (b) Confusion matrix 
for SLDA: 63.75%. (c) Confusion matrix for factorized LDA: 83.91%. 
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Figure 13: KTH action dataset. (a) Confusion matrix for regular LDA: 38%. (b) Confusion matrix for SLDA: 
52.33%. (c) Confusion matrix for factorized LDA: 65.22%. 
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B Generative Process and Gibbs Sampling 

The generative process and the learning procedure of the proposed Factorized LDA model is de- 
scribed below. We adapt the same structure of the presentation as in lfT2l . A detailed discussion 
about Gibbs Sampling in LDA is presented in ifTOl from where we adopt Collapsed Gibbs Sampling, 

1. Sample the topic proportions 6 ~ Dir(a) Yl^ =1 A(k) 

2. For each of the word w mn 

(a) Choose a topic assignment z 

(b) Choose a word w mn from f3 Zmn which is the distribution of words under the topic 
assignment of z mn . 

The additional prior over we introduce will for a document with label c for the k\th topic present 
itself as an additional factor F c k m me update equation of the Gibbs Sampling, 

U k -.i + n t n m ^i + a k 

p(z % = k\z^ w, a, tt, n) oc — - • — ^— -— • T ck . (8) 

Et=i + ^ Efc=i n ™ + a fc ] - 1 

In the updating procedure for T c k a gradient step is carried out by measuring the possible change 
with respect to the prior AA(k) = A* ssumQ (k) - A(k), where A assume (&) is the evaluation of Eq.|4] 
for the current sample. Finally the change is translated to make it a proper regulator as AA(k) G 
(—1,1). Thus the factor takes the following form, 

T k = l + AA(k). (9) 

To introduce additional control over the learning process we introduce a parameter k that controls 
the contribution of this additional term, 

7ife = l + |AA(fe)|«, ifAA(k)>0 

T k = l-\AA(k)\ K , ifAA(k)<0. 1 } 

In our experiments we used a constant value between 0.2 — 0.35. 

The regularization term T c k in Eq. [8] needs to be updated every iteration. We will now proceed to 
describe the specifics of these updates. The change induced to A A by adding one assignment to 
topic k can be written as, 

A A . — Acissume a 
*-±s*-c,k — S± c ,k ~ S±c,k 

( Tj assume \2 rrassume ( tt \2 , tt ^^^\ 
= [ H c,k ) ~ H c,k - \ H c,k) +U C ,k, (11) 

^ c=C 

H ^ = ~i JTTrA Yl ^c,klogy c ,k, (12) 



where, 



EK IK) i 1 -1 / '"^ 
= fc=1 n +*„]-! 

and by assuming adding 1 to n m> ^, we get, 

rrassume 1 \ A vyy assume 7 __/\T f assume \ 

c,/c = ~ (log(l/C) ^ ,fc ,/c 

m=l I r V K ' (fc) , 1 J ' °rnc 
^assume LZ^fc=i ^ ^"fcj /ir\ 

c) mc is the Kronecker delta function. This is used only to observe the possible effect on the H c & 
measurement, which is not used to update any sufficient statistics. Plugging equation [TOl ITTl [T3l 
back to equation [U the final update equation will be achieved. 
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Figure 14: the illustration of and Q class . K=10 is used here as an example 



Figure [141 gives an intuitive plot of the procedure. We can treat as a matrix (left) where each 
row represents a topic distribution for each document. We introduce Q class as a representation of 
structured by class, as shown in the right image. For traditional LDA, the updating is operated for 
each row. Other related work proposed for solving different problem, for example, [ 12 1 considered 
the distance of each row during updating. In our approach, we take consideration to column- wise 
distribution. This is what allows us to softly factorize the topic space in the class-dependent part and 
class-independent parts. The experiment proves our hypothesis that modelling "signal" and "noise" 
separately can describe the real world data better and achieve a better performance. 
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